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Abstract: We consider variable selection problem in linear regression 
using mixture of g-priors. A number of mixtures are proposed in the 
literature which work well, especially when the number of regressors p 
is fixed. In this paper, we propose a mixture of g-priors suitable for the 
case when p grows with the sample size n. We study the performance of 
the method based on the proposed prior when p = 0(n b ), 0 < b < 1. 
Along with model selection consistency, we also investigate the perfor¬ 
mance of the proposed prior when the true model does not belong to the 
model space considered. We find conditions under which the proposed 
prior is consistent in appropriate sense when normal linear models are 
considered. Further, we consider the case with non-normal errors in the 
regression model and study the performance of the model selection pro¬ 
cedure. We also compare the performance of the proposed prior with that 
of several other mixtures available in the literature, both theoretically 
and using simulated data sets. 

Keywords and phrases: Model false case, model selection consistency, 
model true case, normal linear models, scaled inverse chi-square prior.. 


1. Introduction 

We consider the regression problem with a response variable y and a set of 
p potential regressors x 1 ,x 2 , ... ,x p . Let y n = (2/1,2/2, ■ ■ ■, y n ) be a set of 
n observations on y, and X n = (x!,x 2 ,..., x p ) be the n x p design matrix, 
where x,; is the vector of n observations on the i th regressor Xi, i = 1, 2 ,... ,p. 
We write 

y n = Hn + e ni (1-1) 

1 
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where fL n = E( y n |X n ) is the regression of y n on X n and e n is the vector 
of random errors. If we assume the normal linear regression model, then 
= /3ol + X n /3 and e n ~ N n ( 0, a 2 /). Here /3 0 is the intercept, 1 and 0 are 
the n x 1 vectors of ones and zeros, respectively, and ($ = (Pi, @2 , • ■ ■ , P p ) is 
the vector of regression coefficients. 

In this article, we study the variable selection problem. Given a set of p 
available regressor variables, there are 2 P possible linear regression models 
depending on which regressors are included in the model. The space of all 
these models is denoted by A and indexed by a, where each a consists of a 
subset of size p(a) (0 < p(a) < p) of the set {1,2,.. .p}, indicating which 
regressors are selected in the model. The a th model M a is stated as 


M a '■ Hn ~ A)1 + X a /3 a , 


( 1 . 2 ) 


where X Q is a sub-matrix of X n consisting of the p(a ) columns specified by 
a and j3 a is the corresponding vector of regression coefficients. We assume 
that all the components of (3 a are non-zero. This ensures there is at most one 
true model in the model space. Our purpose is to choose the model a G A, 
which best explains the data. 

In a Bayesian approach, each model M a is associated with a prior probabil¬ 
ity p(M a ) and the corresponding set of parameters 6 a = (p 0 , /3 al a 2 ) involved 
in the model, is also associated with a prior distribution p^alMo,). Given the 
priors, one computes the posterior probability of M a as 


where 


p(M a \y„) 


p(M a )m a ( y n ) 
E Q ^p(K)m Q (yn)’ 


(1.3) 


m a ( y n ) 


p(y n \e a , M a )p(O a \M a )de c 


(1.4) 


is the marginal density of y n under M a and p(y n \9 a , M a ) is the density of y n 
given 9 a under M a . In our search for a model, p(y n \9 a , M a ) will be taken to 
be normal. We consider the model selection procedure that selects the model 
in A with the highest posterior probability. 

We consider a prevalent conventional prior on f3 a , the g-prior due to Zellner 
(1986). Properties of g-prior are studied extensively in the literature (see, e.g., 
George and Foster (2000), Berger and Pericchi (2001), Fernandez, Ley and 
Steel (2001)). The prior specification induced by the g-prior method crucially 
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depends on the choice of the hyperparameter g (see, e.g., Berger and Pericchi 
(2001), Liang et al. (2008)). It has also been argued in the literature that 
this method is subject to inconsistencies like Bartlett paradox (see Bartlett 
(1957), Jeffreys (1961)) and information paradox (see Zellner (1986), Berger 
and Pericchi (2001)). Liang et al. (2008) considered a mixture on g instead 
of considering a fixed g to overcome these inconsistencies. Subsequently, a 
number of mixtures on g are proposed in the literature. In this paper, we 
propose a mixture n (g) on g, suitable for the case when p grows with n. 

We assume without loss of generality that the columns of X Q are centered, 
so that l'xj = 0 for all i. The intercept /3o and the scale parameter a 2 are 
common to all models and assumed to be independent of other parameters. 
We use standard non-informative priors for /3q and a 2 . For justification of 
using such priors, see Bayarri et al. (2012, Sec 3.3). The vector /3 a is as¬ 
sumed to follow a normal distribution with location parameter 0 and scale 
go 2 (x' a x a ) 1 , where the hyperparameter g has density 7r(g). The complete 
prior specification is given by 

p(/3 0 ,cr 2 \M a ) = (3 a \ (p 0 ,a 2 ,g,M a ) ~ N p{a) (0, ga 2 (X' a X a )/f^) 

and g ~ n(g). 

We do not consider any specific prior probability on the model space. We 
only impose some conditions on model prior probabilities under which our 
results hold. Similar setup has been considered by many authors, see, e.g., 
Liang et al. (2008), Bayarri et al. (2012). 

Among the existing mixtures of g-priors, the earliest one, to the best of 
our knowledge, is due to Zellner and Siow (1980), who proposed a Cauchy 
prior on (3 a . Since Cauchy is an inverse gamma scale mixture of normal 
distributions, their prior proposition is considered as a mixture of g-priors. 
Other priors include hyper-g and hyper-g/n priors proposed by Liang et al. 
(2008), generalized g-prior of Maruyama and George (2011) and robust prior 
proposed by Bayarri et al. (2012). Henceforth, we will refer to these priors 
as Zcllner-Siow prior, hyper-g or hyper-g/n prior, generalized g-prior and 
robust prior, respectively. 

Inspite of existence of the above mixtures of g-priors there is not much dis¬ 
cussion in the literature on which mixture one should use in a given situation. 
Bayarri et al. (2012) described some desirable properties a prior should sat¬ 
isfy in the context of Bayesian model selection. Ley and Steel (2012) made an 
extensive simulation study to compare several priors. However, none of them 
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considered the case when p increases with n. Maruyama and George (2011) 
proposed a prior which is applicable when p > n, but proved consistency of 
their method for the case when p is fixed. Shang and Clayton (2011) proved 
consistency for mixture of g-priors when p increases with n but their setup 
differs from the usual g-prior setup with respect to the covariance structure of 
the prior distribution of (3 a . Wang and Sun (2014) investigated properties of 
different mixtures for the case with growing number of regressors. However, 
they only established results for Bayes factor consistency. 

In this paper, we consider a scaled inverse chi-square prior on g with appro¬ 
priate parameters. The Zellner-Siow prior belongs to this family. In a sense 
we consider a modified form of Zellner-Siow prior by choosing an appropriate 
scale parameter. An advantage of this prior is that it provides an approxi¬ 
mation to the marginal density in (1.4) with a closed-form expression which 
facilitates easy implementation and theoretical studies. Further, it satisfies 
many attractive consistency properties when p increases with n. Most of the 
existing mixtures fail to be consistent for such rates of increase of p. The 
good properties of this prior are not restricted to the case where the error 
distribution is normal. For a general class of error distributions with min¬ 
imal assumptions, the proposed prior performs reasonably well in terms of 
consistency. In Section 2, we explicitly describe the form of the prior and the 
motivation for considering the same. 

In our investigation, we assume that the number of regressors p increases 
with n at a rate p = 0(n b ), 0 < b < 1 and p < n. This is the so called 
‘large p large n regime ’ and is of theoretical interest in contemporary research 
(see, e.g., Fan and Peng (2004), Moreno, Giron and Casella (2010), Sparks, 
Khare and Ghosh (2012), Johnson and Rossell (2012)). In practice, one can 
conveniently use methods applicable in ‘large p large n regime ’ when there is 
a sizable number of regressors compared to n and the number of competing 
models in the model space is significantly large compared to n. 

We show that our proposed prior is consistent in appropriate sense for a 
large class of models under reasonable assumptions. We consider the following 
two cases separately. First, we consider the case when the true model belongs 
to the model space A, i.e., the true regression p, n is as in (1.2) for some a. 
This case is referred to as the ‘model true’ case. A well known notion of 
consistency in this regard is model selection consistency which requires the 
posterior probability of the true model to go to one as n —> oo. We examine 
model selection consistency of the proposed mixture of g, along with that of 
some other existing mixtures. We then consider the case when pi n can be any 
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unknown vector, not necessarily in the span of {l,Xi,xo,..., x p }. This case 
is referred to as the ‘model false’ case. ‘Model false’ case has been previously 
studied in Shao (1997), Chakrabarti and Ghosh (2006), Chakrabarti and 
Samanta (2008) and Mukhopadhyay, Samanta and Chakrabarti (2014). We 
investigate consistency of the proposed prior in this case using an appropriate 
notion of consistency. 

The presence of information paradox in Zellner’s g-prior remains one of 
the key motivations for considering mixture of g-priors. So, it is important 
to verify whether the proposed mixture can resolve the information paradox, 

1. e., is information consistent in the sense of Bayarri et al. (2012). Along with 
the above notions of consistency we study information consistency as well. 

Sections 3 and 4 of the paper deal with model selection consistency. In 
Section 3.1, we first consider the case when the error distribution is normal. 
In Section 3.2 we relax the condition of normality on e n . Here e.jS are as¬ 
sumed to be i.i.d. with mean 0 and finite fourth moment. In Section 4, we 
consider the performance of some other mixtures on g with respect to model 
selection consistency in normal linear model setup. Section 5 deals with the 
‘model false ’ case. We find sufficient conditions under which the proposed 
prior is consistent in an appropriate sense under general error distributions. 
In Section 6, we consider information consistency. In Section 7, we validate 
the performance of the proposed prior with extensive simulation studies. We 
study the performance of the proposed prior in comparison with several other 
existing priors in the literature. Finally, we make some concluding remarks 
in Section 8. Proofs of all the main results are presented in the Appendix 
(Section 9). 

2. Scaled inverse chi-square mixture of g-priors 

We first motivate our proposal of a mixture on g. Most of the mixtures on g 
in the existing literature are highly positively skewed having a unique modal 
point close to zero and a very flat decay. For example, hyper-g and hyper -g/n 
priors are T-shaped with modal point at 0. Again, if we consider popular rec¬ 
ommendations of g in Zcllner’s g-prior, choices include the unit information 
prior (g = n, Kass and Raftery (1995)), the choice of g related to the risk 
inflation criterion (g = p 2 , see Foster and George (1994), George and Fos¬ 
ter (2000)), and the benchmark prior (g = max{n,p 2 }, Fernandez, Ley and 
Steel (2001)). Recently, Mukhopadhyay, Samanta and Chakrabarti (2014) 
presented some theoretical results to explain why a relatively larger value 
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of g yields better results, especially when p grows with n and recommended 
using g = n 2 for practical purposes. From such recommendations, it seems 
reasonable to put relatively higher probability masses to higher values of g 
for a mixture. Thus, there persists a gap in the domain of g getting relatively 
higher mass when a fixed g is considered compared to that of a mixture. 
We now propose a mixture which gives more probability mass to a range of 
relatively higher values of g compared to the existing mixtures. We consider 
the scaled inverse chi-square mixture 7r(g) on g with scale parameter r 2 = n 2 
and degrees of freedom z/, given by 


^( 9 ) 


(rV/2)- /2 exp[-rV/(2 g )] 
r(z//2) g l +"/ 2 


g > 0, v > 0, r 2 > 0. 


( 2 . 1 ) 


Although, the hyperparameter v can take any positive value, we recommend 
using values between 1 and p. In this paper, we will consider two extreme 
choices of u, namely, u = 1 and v = p. Note that such choices of hyperpa¬ 
rameters ensure that the prior has a unique mode at n 2 v/ {y + 2) and a very 
flat decay. 

The use of inverse gamma distribution in a scale mixture of normal priors 
for (3 a is a common practice in Bayesian model selection. It has already been 
stated that the Zcllner-Siow prior for (3 a is an inverse gamma scale mixture of 
normals with shape parameter 1/2 and scale parameter nj 2. I 11 the context 
of linear regression models with shrinkage priors, Park and Casella (2008) 
and Hans (2009), while introducing Bayesian version of lasso, used inverse- 
gamma priors for similar normal scale mixtures for /3 a . The proposed prior 
in (2.1) is same as the inverse gamma prior with shape parameter u/2 and 
scale parameter ri 2 v/ 2. 

An advantage of considering this prior is that it yields a closed form ap¬ 
proximation to the marginal density, which is similar to the form of the 
marginal for a g-prior with some fixed choice of g. Availability of closed form 
marginals (posterior probabilities) is not necessary for good inference but it 
serves as a desirable property for easy implementation. This approximation 
makes the application of this prior simple and theoretically tractable. 
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2.1. Posterior probability 

For the linear model setup, the vector of parameters in model M a is given 
by 6 a = (/3 0 , (3 a , a 2 , g). The marginal density m a (y n ) in (1.4) is given by 


m a (y n ) 

= j P(y n \/3o, P a , M MP a \Po, V 2 1 9, M a )p(P 0 , CT 2 )tt(3MA), Pal ^, 0), 


where p(y n |An P a -> ° 2 ■> M a ) is p.d.f. of the n- variate normal distribution with 
mean /3 0 1 + X a /3 a , dispersion matrix a 2 I and p(/3 a |/?o, a 2 , g, M a ), p(/3 0 , a 2 ), 
7T (g) are as in equations (1.5) and (2.1). 

Integrating the integrand above with respect to fio, ft a and a 2 , we get a 
closed form expression which leads to 


m Q (j n ) 


r(n-l)/2 , 2r (n-l)/2 f°° (1 + g)(n-l-p( a) )/ 2 

7 T^-V/ty/n' y ' Jo [1 +g(l - Rl)] {n ~ 1)/2 


n(g)dg, 


( 2 . 2 ) 


where S 2 = 
Znot (Z' nQ Z na 


|y n - y n l\\ 2 /n, (1 - R 2 a ) = y ' n (I - P n (a))y n /(nS 2 ), P n (a) = 


v-l 


Z rio , and Z na 


1 X, 


Note that the marginal density of the intercept only model Af/v : y n = 
/3 0 1 + e n , which will be referred to as the null model, does not involve the 
hyperparameter g. ft can be obtained as a special case of the marginal in 
expression (2.2) by putting R 2 y — 0 and p(a) = 0. 

For models a£i \ {N}, the marginal density given by the proposed prior 
(2.1) does not have a closed form. But, we can make an approximation of 
the marginal density in a closed form expression when the proposed mixture 
(2.1) is used in (2.2). When p is fixed, this approximation can obtain an 
accuracy of order n~ l with probability tending to 1, which is the same as 
the accuracy of the Laplace approximation for fixed p (see Kass and Raftery 
(1995)). When p is increasing, the Laplace approximation may not be valid 
for the integral in (2.2) for commonly used priors on g, since the integrand 
may not be Laplace regular (see Kass, Tierney and Kadane (1990)). When 
p = 0(n 6 ), 0 < b < 1 and v = 1 (or, when v is free of n) , the approximation 
is less accurate with an error of the order But, if v — p (or, if v is 

of same order of n as p), the approximation still attains an accuracy of the 
order n~ l . 
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We first state the assumptions under which the approximation holds. 
Throughout this paper y n is modeled as (1.1) and we assume the follow¬ 
ing: 

(A.l) n < M for some constant M > 0 as n —> oo. 

We make a mild assumption on distribution of e n . 

(A.2) The errors e±, e 2 ,..., e n are i.i.d. with a common density having 
mean 0 and finite fourth order moment. 


We now state the result. 


Result 2.1. Consider the set of priors (1.5) and (2.1) with v varying from 
1 to p. Under assumptions (A.l) and (A.2), the marginal density in (2.2) 
satisfies the following: 


and 

where 


m a { y n ) < m a (y n ) (l + —0(1)) 

m a (y n ) > m a ( y n ) (l + — Op(l )) , 
V vn / 


m a ( y„) = 


r((n 


l)/2)T((u + p(a))/2) 
V / nT(z//2) 




-in- 1)/2 


uniformly in a, for any a6d \ {A^} as n —^ oo. 


^uy p{a)/2 


From Result 2.1, we find an approximation to the marginal density as 

m a (y n ) « m a (y„), (2.3) 

in the sense that the ratio of m a (y n ) and fh a (y n ) goes to 1 in probability. 
This approximation holds uniformly in a since the 0(1) and O p (l) terms can 
be made free of a. It is easy to check from Result 2.1 that if v = n r for some 
0 < r < b, then the approximation is accurate upto an order 1 /n 1+r ~ b . A 
simulation study on the performance of this approximation is added in the 
supplementary hie. 


3. Model selection consistency of scaled inverse chi-square 
mixture 

In this section, we assume that the true mean pi n can be expressed as a linear 
combination of a subset of the p regressors. Let M 0c , a c G A be the true 
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model. An ideal model selection procedure should identify the true model in 
this framework. Therefore, a natural criterion in this case is model selection 
consistency, which is achieved if posterior probability of the true model, given 
by (1.3), converges to one in probability, i.e., 

p(M a Jy n ) A 1 as n —> oo. (3.1) 

In Section 3.1, we provide sufficient conditions under which (3.1) holds 
when the error distribution in (1.1) is normal. In Section 3.2, we relax the 
assumption of normality and assume that e n follows any distribution satisfy¬ 
ing assumption (A.2). In this wider class of distributions, we investigate how 
the set of sufficient conditions for achieving (3.1) modify under the nested 
model setup. 

3.1. Model selection consistency with normality assumption 

Throughout this subsection, we will assume that e n rv_/ N n (0, a 2 1). We show 
that when p grows with n then under appropriate conditions, model selection 
consistency is achieved by the proposed prior considering all 2 P models in 
model space. 

We split the model space into three mutually exclusive and exhaustive 
parts as follows: 

A, = {a£i : M a D M ac , a f a c }, A -2 = {a G A : a ^ Ai, a ^ a c } and 
{a c } where M ac is the true model. We assume that 

(A.3) lim ?woo n s min ag yi 2 p' n (I — P n (a))p n /n > 5 for some constants 5 > 
0 and 0 < s < 1. 

We impose a general restiction on model prior probability as 

(A.4) max p(M a )/p(M a i ) < C for some constant C > 0. 
a,a'eA 

A remark on each of the assumptions is made below (see Remark 3.2 and 
Remark 3.3). 

Theorem 3.1. Let y n be as in (1.1) with p, n satisfying (A.l) and e n ~ 
N n ( 0, a 2 1). If p = 0[n b ), then the prior specification given by (1.5) and (2.1) 
is model selection consistent for 0 < b < 2/5 when v — 1, and 0 < b < 1/2 
when v = p, provided (A.3) holds with s < (1 — b)/2 and (A.4) holds. 

Remark 3.1. This result is different from those obtained by Wang and Sun 
(2014) who have shown Bayes factor consistency or pairwise consistency for 
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growing number of regressors. Our result deals with the asymptotic behavior 
of the posterior probability of the true model considering all the 2 P models 
in the model space, and it is indeed a much stronger result than pairwise 
consistency. See in this context Johnson and Rossell (2012, p.652). 

Remark 3.2. Assumption (A.3) for s = 0 was assumed by many authors 
(see, e.g., Fernandez, Ley and Steel (2001), Liang et al. (2008), Bayarri 
et al. (2012)). It is a key assumption for model selection consistency which 
ensures that the models can be differentiated. Here we relax this assumption 
by allowing s > 0, which is a natural extension for the situation when p grows 
with n. 

Remark 3.3. Assumption (A./) may seem a bit restrictive whenp grows with 
n. For the results in this paper to hold we do not actually need assumption 
(A.4); we can use much weaker versions of the assumption. For each of these 
results we mention the weaker version of the assumption needed for the proof. 
As a whole we work with assumption (A.4) for simplicity in presentation. 

For Theorem 3.1 to hold, we only need max a eAP(M a )/p(M a c ) < C for 
some constant C > 0. This is a reasonable assumption, since this only indi¬ 
cates that the true model may not have a prior probability arbitrarily close to 
zero, which is necessary to achieve consistency. 

3.2. Model selection consistency in general settings 

In this subsection, we extend our results to situations where the error distri¬ 
butions belong to a larger class satisfying assumption (A.2). We investigate 
the strength of the model selection algorithm when the distribution of the 
errors is non-normal and the same model selection rule (based on the normal 
likelihood) is used. In other words we investigate robustness of our model 
selection rule for non-normal errors. Unlike the case for normal errors, here 
we do not consider all the 2 P models. Rather, we restrict our search within 
a class of nested models. By nested models, we mean a set of models where 
for every pair of models, say, M\ and M 2 , either M 2 C Mi, or Mi C M 2 . So, 
the index set A* of all nested models can be expressed as 

A* = {{^>}, {!}, {!, 2},..., {i, 2,... ,p}} , with A* C A. 

Note that A* has p+ 1 different models. When p = 0{n b ) with 0 < b < 1, the 
number of models in A* also increases with n. While the cardinality of A is 
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exponential in p (i.e., 2 P ), for A* it is linear in p. On one hand generalization 
to non-normal errors broadens the scope of the model selection algorithm, on 
the other, restriction to the class of nested models abridges the model space. 

Comparison of the set of nested models is of great practical interest. The 
situation with a model space like A* may occur, for example, when we have 
information on relative importance of the regressors and the regressors can 
be ordered accordingly. Model selection in nested models has been widely 
studied in the Bayesian paradigm when the error distribution is normal (see, 
e.g., Dawid (1992), Moreno (1997), Cui and George (2008), Wang and Sun 
(2014)). Unlike these authors, who study Bayes factor consistency, we con¬ 
sider model selection consistency restricted to the space of nested models 
when p is of the said order. We summarize our findings in the following 
theorem. 

Theorem 3.2. Let y n be as in (1.1) with fx n satisfying (A.l) and e n satisfy¬ 
ing (A.2). If p = 0(n b ), then the prior specification given by (1.5) and (2.1) 
with v — 1 or v = p is model selection consistent in A* for any 0 < b < 1 
provided (A.3) holds with s < (1 — b)/2 and (A.f) holds. 

Remark 3.4. Here also we do not need assumption (A.4) to hold strictly. 
For Theorem 3.2 to hold we only need max ag _4 * p(M a )/p(M a f) < Cyfn for 
some constant C > 0. This assumption is quite general and includes many 
of the popular class of model prior probabilities. 

4. Properties of Some Existing Mixtures 

We now investigate the performance of some other mixtures from the per¬ 
spective of model selection consistency. The beta prime (beta of second kind) 
prior is the commonly used prior for g (see Liang et al. (2008), Maruyama 
and George (2011), Bayarri et al. (2012)). Therefore, it is worth investigating 
the performance of this prior when p grows with n. Let g follow a beta prime 
distribution with parameters 70 and 71 , then 

n(g) = ^ (70 + 7l) , ^°- 1 (l + 3 ) _(70+7l) , 0 < g < 00 , 70 > 0, 71 > 0. (4.1) 

r (7o)r(7i) 

We show that when p — n b , 0 < b < 1, then for some inappropriate specifi¬ 
cation of the hyperparameters 70 and 71 , the model selection rule given by 
the set of priors (1.5) and (4.1) becomes inconsistent. 
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Theorem 4.1. Consider the setup of Theorem 3.1 and let the true model be 
the null model. If the number of regressors p = n b , 0 < b < 1, then the set 
of priors given by (1.5) and (4-1) with 71 > e for some e > 0 free of n, is 
inconsistent if ( 70 / 71 ) = 0(n 2b ). 

Remark 4.1. In hyper-g prior 70 = 1 and 71 = (a/2 — 1) for some a > 2 
free of n. Hence, it follows from the above theorem that hyper-g prior is 
inconsistent for any b > 0. It is already shown in Liang et al. (2008) that 
hyper-g prior is not consistent under the null model even for fixed p. Hyper- 
g/n prior remains consistent when p is fixed, but it fails to be consistent if 
p = n b , for any b > 0 (the proof is in the supplementary file). 

Remark 4.2. Generalized g-prior has 70 = A + 1 and 71 = R + 1 where the 
authors recommend using A = (n — p(a) — l)/2 — B and B < 1/2 for the case 
when p < n. Hence, it is easy to check that generalized g-prior is inconsistent 
for this recommended settings if b > 1/2. 

Remark 4.3. The robust prior can also be expressed as a truncated scaled 
beta prime distribution as (g + B) / (p a (n + B)) — 1 ~ beta prime(l, A) where 
A > 0, B > 0, p Q > B/(B + n). The recommended choices of hyperparame¬ 
ters are A = 1/2, B = 1 and p a = 1/(1 +p(a)). It has also been recommended 
that p a should be free of n. This makes choice of the parameter p a difficult 
when p = n b , since in that case the choice of p a involves n. We check with 
two choices of p a , a constant p a and p a = 1/(1 +p(a)). It has been shown in 
the supplementary file that when p = n b a necessary condition for consistency 
of the robust prior under the null model is b < 1 / 2 , for both choices of p a . 

There are some other beta shrinkage priors available in the literature. A 
list of which can be obtained in Ley and Steel (2012). Similar results can be 
derived for them. 

Remark 4.4. As we have already mentioned, Zellner Siow prior is also an 
inverse gamma prior with scale n, whereas the proposed one has scale n 2 v/ 2 . 
It can be shown that a sufficient condition of Zellner-Siow prior to be con¬ 
sistent under null model is b < 1/5 (the proof is similar to Case II in proof 
of Theorem 3.1 in Section A. 3) . The increment of scale from the order of n 
to n 2 makes the prior more reasonable for ‘large p large n regime ’ which is 
reflected in the increment of b from 1/5 to 1/2. 

However, similar improvement is not expected from all the priors we men¬ 
tioned before. For example, if we change the scale of hyper-g/n prior from n 
to n r for any r > 1 free of n, it still remain inconsistent when the null model 
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is true, and the number of regressors is p = n b , for any 0 < b < 1. The proof 
is similar in idea to the proof of the result stated in Remark f.l. 

Remark 4.5. As mentioned in Remark 3.3, here also we do not need a 
strong assumption like (A./). For all the above results in this section to hold, 
we only need to assume 

(A.4*) for all a e A, p(M a )/p(M N ) > 5 P ^ for some constant S > 0. 

In situations when Mjy is a candidate for the true model, one may like to 
put additional penalty to more complex models. Therefore, the prior proba¬ 
bilities for the high-dimensional models may be quite small compared to that 
of low dimensional models. Assumption (A.4*) gives us the scope to consider 
those set of prior probabilities also. Consider, for example, the case where the 
prior inclusion probability of each regressor is q, which leads to the Bernoulli 
prior probability q v ^ a \l — qy~pA) t 0 the model M a . The Bernoulli prior has 
been used by many authors (see, e.g., George and Mcculloch (1993)). It is easy 
to see that this prior setting satisfies assumption (A.4*). 

Note that the conditions on b mentioned in Theorem 3.1 are sufficient to 
achieve consistency for the proposed prior. The condition in Remark 4.4 is 
also sufficient for Zcllner-Siow prior, whereas the above conditions on beta 
shrinkage priors (i.e., hyper-g/n prior, generalized g-prior and robust prior) 
are only necessary to achieve consistency. The range of b sufficient to achieve 
consistency is essentially a subset of the range of b necessary to achieve 
consistency. Also, achieving consistency under the null model is only a part 
of achieving full posterior consistency. Sufficient conditions for full posterior 
consistency may even be stronger. 

5. Consistency of Scaled Inverse Chi-square Prior in ‘Model 
False’ Case 

In Sections 3 and 4, we have considered situations when the true mean fx n 
in (1.1) belongs to the span of {1,X!,..., x n }. We will now consider a more 
general scenario where pi n is any n-dimensional vector, i.e., the true model 
does not necessarily belong to the model space A. Several authors have stud¬ 
ied related problems of linear model selection under this framework (see, 
e.g., Li (1987), Shao (1997), Chakrabarti and Ghosh (2006), Chakrabarti 
and Samanta (2008), Mukhopadhyay, Samanta and Chakrabarti (2014)). Of 
course, one cannot compute the posterior probability of the true model here, 
and the usual notion of model selection consistency cannot be used in this 
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scenario. To validate a model selection rule we therefore adopt an alternative 
notion of consistency suited for the model false case as used in Mukhopad- 
hyay, Samanta and Chakrabarti (2014). 

Here consistency of a model selection procedure refers to the property 
of choosing the model which is closest to the unknown true model among 
all candidate models in the model space A (in an asymptotic sense). We 
consider the Kullback-Leibler divergence as the measure of distance between 
two probability distributions. We define the distance A n (a) between the true 
distribution (of y n ) and the model M a as the minimum of the Kullback- 
Leibler distance with respect to the underlying parameters (f3o,/3 a ) of the 
distribution under M a . One would naturally like to choose a model M a * 
which is as close as possible to the true distribution (i.e., for which A n (cO) = 
nrin ae _4 A n (a;)). One can find the model M a * only if the true distribution 
were known, which is not the case here. We show that our model selection 
procedure chooses a model which is closest to the unknown true model in an 
asymptotic sense. 

We consider a general class of error distributions satisfying assumption 
(A.2) and the whole set of 2 P models A for comparison. We make the following 
assumption which is analogous to assumption (A.3), by replacing Ai in (A.3) 
by A as 

(A.3*) lim,,,^^ n s nrin ag 4 p! n (I — P„ja))p,„/n > 5 for some constant 
5 > 0 and 0 < s < 1. 

Note that in the model false case A 2 = A. 

Let the true distribution of y n has a density function /. It can be eas¬ 
ily verified that the Kullback-Leibler distance between the true distribution 
given by the density function /, and the distribution N (l/3 0 + X Q /3 a , a 2 1) 
under M a equals 

I f (y n) log / (yn) dy n + | (l + log a 2 ) 

4 —2 (Pn ~ 1A) — X a /3 Q ) (n n — l/3 0 — X a /3 Q ). 

<7 Z 

The distance A n (a) between the true model /, and the model M a is obtained 
by minimizing the above with respect to (f3o,/3 a ), as follows 

A „(a) = I /(y n ) log/ (yn)dy n + ^ (l + loga 2 ) +£>„(«), (5.1) 
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where 

Dn(a) = i 1 - Pn(a)) /V (5.2) 

Note that the first two terms of A n (a) in (5.1) do not involve a. Therefore, 
argmin a A n (a:) = argmin a D n (a) and we show that 

D n {ot) v ^ q s 

- —)• 1 as n —> oo, (5.5) 

nnn ag _4 D n (a) 

where M & is the model chosen by the model selection rule based on the 
proposed prior. 

Theorem 5.1. Let y n be as in (1.1) with fi n being any real vector in M n 
satisfying (A.l) and e n satisfying (A.2). Suppose the assumption (A.4) holds 
and (A.3*) holds with s < (1 — b)/2. If the number of regressors p = 0{n b ), 
0 < b < 1, then the set of prior (1.5) and (2.1) is consistent in the sense that 
(5.3) holds for any 0 < b < 1 when v — 1 or v = p. 

In Theorem 5.1, it is only assumed that y n is the sum of two components, 
namely, the true mean pi n and the random error e n . Here, fi n is allowed to be 
arbitrary and e n can follow any distribution satisfying assumption (A.2); even 
symmetry or continuity is not required. Thus, given the additive structure of 
y n as in (1.1), consistency is obtained in a much general setting. 

Lastly, assumption (A.4) can be relaxed to a great extent here. For Theo¬ 
rem 5.1 to hold, we only need the following 

(A.4*) max a)Q ./ g . 4 p(M a )/p(M a >) < Cn r for some C > 0 and r > 0. 

This assumption is satisfied for a very large class of prior probabilities on the 
model space. 

6. Information Consistency 

The criterion of information consistency is considered by several authors (see, 
e.g., Jeffreys (1961), Berger and Pericchi (2001), Bayarri and Garcfa-Donato 
(2008), Liang et ah (2008), Bayarri et ah (2012)). While comparing the null 
model with any model M a , suppose that ||/3 a || 2 —> oo (or equivalently, the 
usual F statistics goes to oo) with both n and p(a) are fixed, f3 a being 
the least squares estimator of /3 a . This is considered as a very strong evi¬ 
dence supporting the model M a , and it is expected that the Bayes factor for 
comparing model M a to the null model would go to oo. The property that 


imsart-generic ver. 2014/10/16 file: ims-sample.tex date: April 16, 2015 



Mukhopadhyay/Mixture of g-priors for variable selection 


16 


the Bayes factor goes to oo whenever ||/3J | 2 —» oo with fixed n and p(a) 
is termed as information consistency in Bayarri et al. (2012). However, this 
does not hold in the case of Zcllncr’s (/-prior. For mixture of (/-priors, Liang 
et al. (2008, Theorem 2) give a sufficient condition which ensures information 
consistency. The following result gives conditions under which the proposed 
mixture in (2.1) is information consistent. 

Result 6.1. Consider the set of prior probabilities (1-5). Then the mixture 
on g given by (2.1) is information consistent if n > p + 1 when v — 1 and if 
n> 2 p when v — p. 

The proof of this result is in the supplementary hie. 

Note that for v = 1 the proposed prior is information consistent with 
minimal sample size, i.e., information consistency holds for any n > p (see 
Liang et al. (2008) in this context). But for v = p, the proposed prior fails 
to be information consistent with minimal sample size. 

7. Performance of The Proposed Prior on Simulated Datasets 

In this section we validate the performance of the proposed prior using simu¬ 
lated datasets. We present simulation results for model selection consistency 
under different simulation schemes. In each case, we consider our proposed 
prior with two choices of the hyperparameter v, viz., v — 1 (proposed prior 
I) and v = p (proposed prior II). Along with the proposed prior we also 
consider four other priors on g , namely, Zellner-Siow prior, hyper -g/n prior, 
generalized p-prior and robust prior. 

Our results are designed for the case when p increases with n, and therefore, 
we consider moderately large p compared to n. Three choices of n (n = 
50,100,150) and two choices of p (p + 1 = 30, 50) for each n have been 
considered. 

The theoretical results are not confined to the case with normal errors; 
any error distribution satisfying assumption (A.2) can be considered. We 
consider three different error distributions, namely, normal, Laplace and t 
with degrees of freedom 3 (t( 3 )). Note that t( 3 ) does not satisfy the fourth 
order moment condition of (A.2). Moreover, we consider all the TP models in 
case of Laplace and f( 3 ) distributions also, although our theoretical results 
for general case allow only nested model setup. We consider these settings to 
check the performance of the proposed mixture when some of the assumptions 
of Theorem 3.2 do not hold. The simulation scheme is described as follows. 
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For each combination of (n,p), we generate n values of each of the p re¬ 
gressors xi,X 2 ,,x p and this gives the full design matrix X n . We choose p 
numbers % = 1 ,,p and generate the n values of the i th regressor Xj from 
an N(£i, 1) distribution, i = 1,... ,p. We assume that the n values of the i th 
regressor are coming from a homogeneous population. In order to fix a “true” 
model, we choose its dimension p(a c ) and then choose the p(a c ) non-zero re¬ 
gression coefficients (3/ s, the intercept {3q in the true model and also a value 
for the error variance cr 2 . The p(a c ) columns of the design matrix X ac for the 
true model are chosen at random from the p columns of X n . Here, (£ 1; ..., £ p ) 
is chosen as a random permutation of (0.2, 0.4,..., 0.2 x p). The dimension 
of the true model p(a c ) is chosen as [p/2] and the p(a c ) non-zero regression 
coefficients (3 /s and the intercept fio in the true model are randomly chosen 
from the set {—0.2, 0.4,..., (—l) p x 0.2 x p}. Lastly, we choose a — 1. 

After choosing the dimension p(a c ), the coefficients (f3 0 ,/3 a J, the error 
variance cr 2 of the true model and the design matrix X„, we generate e n 
from normal, Laplace and t( 3 ) distributions, each with location vector 0 and 
dispersion matrix cr 2 /. The vector of observations y n is obtained by adding 
fi n = 1 (3 0 + X nQc /3 Qc to e n . Having obtained the data, we compute the pos¬ 
terior probability of the true model using the set of priors (1.5) for several 
mixtures on g as indicated above. There are two issues to be mentioned here. 
Firstly, for calculation of the marginal, one needs to calculate the integral 
in (2.2), which is not of closed form for all the mixtures. We use numeri¬ 
cal integration (available in R software) to calculate this integral for all the 
mixtures. Secondly, since p is large, calculation of the posterior probability 
for the candidate models a G A becomes quite infeasible. This is because 
calculation of posterior probability of any model requires the marginal densi¬ 
ties (m a (y n )) for all the 2 P candidate models in A. Therefore, we use Markov 
Chain Monte Carlo simulation techniques to approximate the posterior prob¬ 
abilities, where computation of marginal densities can be restricted only to 
the models visited by the chain. We have used the Gibbs sampling algorithm, 
to simulate from the relevant Markov chain. The sampling scheme and the 
method of computation of posterior probabilities are completely described in 
Chipman, George and McCulloch (2001, Section 3.5). We have generated a 
Markov chain of length 10000 of which the first 5000 have been used as burn- 
in. Similar simulation is also used in Scheme 2 of Mukhopadhyay, Samanta 
and Chakrabarti (2014). 

For each combination of (■ n,p ) and each of the mixtures of g, we repeat the 
above for 100 times fixing the chosen values of £/s, p(a c ), (3 /s and cr. The 
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mean and mean square error of the posterior probabilities of the true model 
are presented in Table 1. In the table, the mean posterior probabilities have 
been shown, keeping the mean squared errors (m.s.e.) in brackets. 

From Table 1, it is evident that performance of the proposed mixture 
is better than the other mixtures. The differences in performances of the 
proposed priors from others increase with n. Among the two choices of u, the 
choice u — p performs better than u — 1 . Among the other priors, robust prior 
has the best performance. For example, when f( 3 ) is considered with n = 150 
and p = 50, for the proposed prior I (proposed prior II), the Markov chain 
visits the true model in around 37% (46%) cases, whereas for the robust prior 
it visits the true model only in 6% cases. A similar phenomenon is observed 
in other cases as well. 

Next, we consider two sparse situations. We first consider the case where 
the null model is true ( Scheme 1 ). We take p = 30 and /3 0 = 5 and assume e n 
to follow a normal distribution. The rest of the simulation scheme is as de¬ 
scribed above. The mean and m.s.e. of 100 replicates of posterior probabilities 
of different mixtures on g are shown in Figure 1. 

Lastly, we consider the interesting situation where sparsity is present in the 
simulation scheme (Scheme 2) in the sense that a set of regression coefficients 
in the true model is negligible, even if not exactly zero. Here, it is desirable 
to select the parsimonious model (say a s ) that includes all regressors with 
significant regression parameters, rather than the true model (ay). We take 
p = 30, p(a c ) = 15, a 2 = 1. The error distribution considered is normal. 
The regression parameters of the true model is as follows: /3j = i + 1 for 
i — 0,1,..., 4, 0 < |/3j| < 0.008 for i = 5, 6,... 15 and [/ — 0 for i > 15. 
For different n, we plot the mean and m.s.e. of posterior probabilities of the 
sparse model, a s = {1,2, 3,4} in Figure 2. 

From both the figures, it can be seen that the performances of the proposed 
priors are distinctly better than that of the other priors. When the null 
model is true (see Figure 1), the posterior probabilities of all other priors 
are less than 0.00002 (therefore, the corresponding lines are not visually 
distinguishable in the figure), whereas the proposed prior for u = p achieves 
an average posterior probability of 0.59, when n = 150. In Scheme 2 (see 
Figure 2), the other priors perform relatively better than Scheme 1. The 
highest average posterior probability among all the other priors is achieved 
by the generalized g-prior for n = 150 and is less than 0.075, whereas the 
average posterior probabilities achieved by the proposed priors for n = 150 
are 0.358 for v — 1, and 0.599 for v = p. 
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8. Concluding Remarks 

In this paper, we propose a mixture of g-priors suitable for the case when 
p grows with n. The resulting marginal has an approximation with a closed 
form expression which makes implementation simple. We investigate the per¬ 
formance of the proposed prior by deriving consistency properties under dif¬ 
ferent setups. We also compare its performance with that of several other 
mixtures using simulation results under different simulation schemes which 
demonstrates its nobility. Theoretically as well as in simulations, superiority 
of the performance of the proposed prior has also been shown under sparse 
situations. 

The prior for (3 a arising from this mixture has a very thick tail which is 
recommended by Jeffreys (Jeffreys (1961)). Further, this structure of priors 
(1.5) has the properties like predictive matching and group invariance as 
described in Bayarri et al. (2012) (see Results 2-4 of Bayarri et al. (2012) in 
this context). The authors have explicitly justified the adoption of the form 
(1.5) in a broader context. 

Finally, it may be mentioned that we have studied the performance of the 
proposed mixture for v = 1 and v = p. The performance of the mixture with 
v = p is better than the other in the light of all the properties considered 
in this paper except information consistency. The prior with v = p fails to 
be information consistent when n < 2 p. In practice, when n > 2p one can 
conveniently use the prior with v = p. 

Appendix A 

In this section, we present the proofs of most of the main results stated in 
this paper. Many of the statements in the following proofs hold with proba¬ 
bility tending to 1 as n —> oo, although this will not be always mentioned. 
Throughout this section we will assume that Var(e n ) = a 2 1, where a 2 > 0 
is unknown. 

A. 1. Auxiliary Results 

We first state three lemmas which will help in proving our main results. The 
proofs of these lemmas are given in the supplementary hie. 
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Table 1 

Average and mean squared error of posterior probability of the true model. 


error 

density 

Priors 

p + 1 n 

Zeller-Siow 
mean (m.s.e.) 

Hyper -g/n 
mean (m.s.e.) 

Generalized g 
mean (m.s.e.) 

Robust 
mean (m.s.e.) 

Proposed I 
mean (m.s.e.) 

Proposed II 
mean (m.s.ea 


30 

50 

0.0948 (0.8294) 

0.0949 (0.8293) 

0.0956 (0.8282) 

0.0981 (0.8242) 

0.1251 (0.7836) 

0.1598 

(0.73|6) 



100 

0.2916 (0.5095) 

0.2919 (0.5091) 

0.2926 (0.5081) 

0.3056 (0.4903) 

0.4598 (0.3131) 

0.6163 

(O.lTgl) 

Normal 


150 

0.4932 (0.2954) 

0.4962 (0.2922) 

0.3758 (0.4192) 

0.5358 (0.2515) 

0.6834 (0.1393) 

0.9055 

(o.om) 


50 

50 

0.0000 (0.9999) 

0.0000 (0.9999) 

0.0000 (0.9999) 

0.0000 (0.9999) 

0.0006 (0.9999) 

0.0008 

(0.99||) 



100 

0.0018 (0.9964) 

0.0017 (0.9965) 

0.0018 (0.9965) 

0.0019 (0.9963) 

0.0255 (0.9525) 

0.0414 

(0.92j5) 



150 

0.0474 (0.9086) 

0.0475 (0.9083) 

0.0477 (0.2749) 

0.0525 (0.8991) 

0.2581 (0.4382) 

0.4820 

(0.3032) 


30 

50 

0.1611 (0.7156) 

0.1603 (0.7167) 

0.1614 (0.7151) 

0.1669 (0.7067) 

0.2155 (0.6387) 

0.2767 

(0-55|7) 



100 

0.1937 (0.6620) 

0.1933 (0.6626) 

0.1944 (0.6608) 

0.2028 (0.6484) 

0.3374 (0.4764) 

0.4369 

(0.37|0) 

Laplace 


150 

0.2573 (0.5640) 

0.2575 (0.5636) 

0.2578 (0.5632) 

0.2734 (0.5419) 

0.4822 (0.3073) 

0.6144 

(0.2Q|5) 


50 

50 

0.0011 (0.9978) 

0.0011 (0.9978) 

0.0011 (0.9978) 

0.0011 (0.9978) 

0.0013 (0.9974) 

0.0068 

(0.9866) 



100 

0.0052 (0.9896) 

0.0053 (0.9895) 

0.0053 (0.9895) 

0.0055 (0.9890) 

0.0585 (0.8941) 

0.0816 

(0.85§4) 



150 

0.0069 (0.9863) 

0.0070 (0.9861) 

0.0070 (0.9862) 

0.0075 (0.9851) 

0.1166 (0.8024) 

0.1563 

(0.74&) 


30 

50 

0.1059 (0.8103) 

0.1060 (0.8103) 

0.1065 (0.8094) 

0.1091 (0.8053) 

0.1673 (0.7252) 

0.1921 

(0-69*8) 



100 

0.3191 (0.4700) 

0.3197 (0.4693) 

0.3204 (0.4683) 

0.3342 (0.4497) 

0.5003 (0.2674) 

0.6593 

(0.1332) 

H 3) 


150 

0.5419 (0.2103) 

0.5460 (0.2488) 

0.4292 (0.3629) 

0.5834 (0.2141) 

0.7100 (0.1190) 

0.8965 

(0.02|4) 


50 

50 

0.0004 (0.9993) 

0.0004 (0.9993) 

0.0004 (0.9992) 

0.0004 (0.9992) 

0.0307 (0.9444) 

0.0254 

(0.9522) 



100 

0.0080 (0.9840) 

0.0081 (0.9840) 

0.0080 (0.9840) 

0.0084 (0.9834) 

0.0713 (0.8765) 

0.0911 

(0.8468) 



150 

0.0562 (0.8924) 

0.0562 (0.8924) 

0.0557 (0.8931) 

0.0615 (0.8826) 

0.3744 (0.4279) 

0.4653 

(0.3316) 


to 

o 
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Fig 1: Mean and M.S.E. of Posterior Probabilities of the True Model in Scheme 1 
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Lemma A.l. If y n = n n + e n with i± n satisfying assumption (A.l) and e n 
satisfying assumption (A.2), then the following results hold as n —>• oo: 

(0 e = £ILi e,/n 4 0, ' 

(n) y',' il'Ki/n A 0, 

(in) S 2 A cr 2 where nS 2 = X^Ai ( e * ~~ e) 2 , 

(iv) ma x a£A e' n P n (a)e n /n = O p (p/n), 

(■ v) max ag _ 4 2 \n' n (I - P n (a))e n \ /n = O p (^a /p/n) and 

(■ vi ) e' n (I — P n (a))e n /n A a 2 uniformly in a £ A. 

Lemma A.2. Let R 2 a is as in (2.2). Under assumptions (A.l) and (A.2), 
(1 — RK) > cr 2 / (2 M + 4 a 2 ) with probability tending to 1 uniformly in a G A, 
where M is as in assumption (A.l). 

Lemma A.3. Under the setup of Theorem 3.2, for any fixed R > 0, with 
probability tending to one 


r n (o. c ))G n 

a 2 (p(a) - p(a c )) 


< Rlogp. 
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Fig 2: Mean and M.S.E. of Posterior Probabilities of the Sparse Model in Scheme 2 




A.2. Proof of Result 2.1 

Using (2.1) and (2.2), we write, 

m a (y n ) Ci, y , n l 

where, 


n 

'-'1 ,y,n 


r(n — l)/2 ( T 2 v\ v/2 ( c 2 ^-(n- 1)/2 


Y{y/2)-K^ n 




(A.l) 


(A.2) 


and 


1= I e- T2u/{29) g~ {1+u/2) (l + g)^ n - pia) - 1)/2 {l + (1 - Rl)g} (n 1)/2 dg. 

Jo 

(A.3) 

We first evaluate Z. After making a transformation w = r 2 z// (2g), we observe 
that, 


X = C 


2 ,y,n 




2,A (n-pfa)- 1 )/ 2 


2, A -(*»-l)/2 


t ^ 


1 + < 1 - , «&r 


dm, 


(A.4) 
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where C 2) y, n = (tV/2) 


{1 + (1 -R 2 a )r 2 u 


. Next we use the fact that, for any w > 0, 
/( 2 u ,)}- (n - 1)/2 < {(1 - Riyu /( 2 w )}- {n ~ 1)/2 


Use of this inequality along with multiplication and division by (t 2 u/(2w)) ( ' i ‘ p< ‘ 
in R.H.S. of (A.4) gives, 


1<C 


— S,y,n 


e - w w (p(.a)+ v )l‘ 2 -l 


( 2w\ (n - p(Q) “ 1) / 2 

dw - 


(A.5) 


where C 3j y tn = C 2y y tn (r 2 u/ 2) _p ^/ 2 (1 — R 2 a )~^ n ~ 1 ^ 2 . The quantity (n — p(a) — 
l)/2 can either be an integer or a mixed fraction. We first deal with the case 
when (■ n—p(a ) —1)/2 is an integer. We expand last term in (A.5) in binomial 
expansion as follows, 



T Z U J 


(n-p(a)~ l)/2 



p(a) — l)w 

T 2 V 


+ . . . + 


^ 2w j j 

(A.6) 


For a fixed n the sum in (A.6) is finite and hence the integration in (A.5) 
and the summation can be interchanged. It can be easily seen that after 
interchange has taken place, each integration under the summation forms a 
gamma integral of appropriate order. It then follows that, 


X < C. 


— ^3 ,2/,n 


p(a) + zA (n - p(a) - 1) f p(a) + v 


+ 


r 2 u 


+ 1 + 


t 2 v 


[n-p{a)~ l)/2 


n + v — 1 


— c 3 , y>n r 


p(a) + v 


(n-p(a) - 1 )(jp(a) + u) 

2t 2 u 

(p(a) + v){p{oi) + v + 2) + ... + (n + v - 3) 
(-2 yn - pia )- 1 )/ 2 
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For r = n, the above expression is, 


< 


< 


C 


3 ,y,n* 


p(a ) + v 


C. 


3 ,y,n^ 


p(a ) + v 


f n(p(a) + u) n 2 (p(a) + is)(p(a) + v + 2) 
l + 2 n 2 u + 2! (2nV) 2 

... + upto (n — p(a) — l)/2 th term } 

f p(a) + y / p{ot) + v \ 2 

| 2raz/ V 2rai/ / 


V 2?iv J 


using the fact that if a > b > 0 then a/b > (a + l)/(6 + 1). The bracketed 
portion of the last expression is a G.P. series with positive terms. We add the 
terms with higher power and make it an infinite G.P. series and also replace 
p{ot) by p {p{a) < p) to make the series free of a. The resultant term is as 
follows, 


< 




p(a) + v 



p + v 
2 nv 


+ 


( P + v \ 
\ 2 nv J 



(A.7) 


From (A.7) it is clear that 

T< I c 3 ,,, n r0’-^)(i + o(^)) 
" 25 \ (i + o(i)) 


if v = 1 and p = 7i b , 
if v = p and p = n b . 


(A.8) 


We now consider the case when (n — p(a ) — l)/2 is not an integer. Certainly 
then (n — p(a))/2 is an integer. For any w > 0, we write the last term in 
(A.5) as 

(1 + ( 2 uj )/( t 2 0 ) (ti-p( “ ,-1)/2 < (1 + (2w)l(T 2 v)/- r[a])/1 . 

Using the above inequality, we proceed as before in (A.6), (A.7) and get the 
same bound as in (A.8). 

Next, we assign a bound on X from other direction and show that the dif¬ 
ference between the two bound is small. For this, we move back to (A.4) and 
use the inequality (1 + 2w)) > ( t 2 v)/(2w ) along with a multiplica¬ 

tion and division by the factor ((1 — R 2 a ){r 2 v)/( 2w)) ( ' n 1 ^ 2 in the integrand 
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of (A.4). The resultant integral is as follows, 


^ > C3,y,n 


^ ,y,n 


W 


(v+p{a))/ 2-1 e -w / 


W 


(v+p(a))/ 2-1 £ -w I 


1 + (1 - RI)t 2 u/(2w) 

(n-l)/2 


(n-l)/2 


dw, 


2 w 


(! - R 2 a )r 2 u 


dw, 


where Cs t y tn is the same as in (A. 8). As before, here also we deal separately 
two cases when (n — l)/2 is an integer and when it is not. First consider the 
case when it is an integer. Expanding as before we get, 

2 w \ (n " 1)/2 

{1-RI)t 2 uJ 

(n— l)w (n — l)(n — 3)w 2 f — 2 w \ (n_1) / 2 

rV(l-i? 2 )+ 2!r 4 z/ 2 (l — R 2 a ) 2 ~ " ' + V(1 - Rl)r 2 u) 


w 

1 | 

( w V 

1 

( w ) 

nv(l - R 2 a ) 

2! 1 

\{l-R 2 a )nv) 

" i( n ~ l)/2)! 

\(l-R 2 a )nv) 


putting t = n and replacing (n — 1), (n — 3), etc. by n. As before, we 
interchange the summation and integration and resultant term is as follows 



+ (p(u) + v) (p(a) + iy)(p(a) + is + 2 ) 

“ ^ V 2 )\ 2nu(l-R 2 a ) 2\{2nv{l-R 2 a )f 

V ( p( a ) + v \ (p(«) + ^) ( (p(q) + ^) V 

- 3 ’^ V 2 )\ 2nu(l-R 2 a ) \2nv(l-R 2 a )J 

_ ( (p(oQ + v) \ (n_1)/2 _ 

••• \2nu(l-R 2 a )J 

r f p( a ) + u \ fi (P + y ) ( (P + ^) V 

- 3 ’ y ’ n V 2 n 2m/(l-i%) ^2711/(1-^2)^ ••• 


From Lemma A.2, we know that for all a 6 4, (1 - has a fixed positive 
lower bound with probability tending to 1. Using the lemma we get the 
following result 


1 > 


Cz,y,rX 

C^y.nT 


p + v 
~2 
' p + V 



if v = 1 and p = n b , 
if v — p and p = n b . 


(A.9) 
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Hence, from (A.l), (A.8) and (A.9), the result follows. 


□ 


A.3. Proof of Theorem 3.1 

We need to show 

E 


P(M„) m„(y„) ^ 0 _ fori = 12 




p(M„J m„„( j„) 


(A. 10) 


We proof separately for the cases M ac j- M N and M ac = M N . 
Case I. M ac M N . We first consider (A. 10) for i = 2. 

From Result 2.1 we have 

_ jj>2 \ ( n ~ l)/2 


m Q ( y„) 
m ac ( y n ) 


< 


n ^ v \ -CO-p(a c ))/2 


1 - Bf 

T{(v + p(a))/2] {l+pO(l)/{vn)} 


T{(p + p(a c ))/2} {1 + pO„(l)/(pn)} (A ' n) 

where the terms 0(1) and O p ( 1) are free of a. We consider the terms of 
R.H.S. of (A. 11) one by one. First we consider the second term. We have for 

OL G *4-2, 

1 -Rl 


1 -Rl 


A Rn (-^ Pn(ot)) R n A 2/x n (/ P n (a)) e n e n P n (c/)e r 


> 1 A 


e' n e n /n \a£A 2 


mm 


e n ( J - ^(«c)) e 
Rn( J - p n(a))R 


n 


- 2 max ^' n{1 Pn{a))e ^ 
aeA 2 n 


— max 

aeA 2 


n 


By assumption (A.3) and from parts (■ iv ) and (v) of Lemma A.l, we have 

- { min \i n (/ - P„(a)) ii n - max (2 | n' n (/ - P n (a)) e n \ + e' n P n (a)e n ) 1 > 

for some 5 0 > 0, with probability tending to 1 provided s < (1 — b)/ 2. Since 
e n e n /n , 


max 
ae^2 \ 1 


l - Pi 


R 2 

1 L a 


(n-l)/2 


<11 + 4 

71 s 


—(r»—1)/2 


(A.12) 
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for some <5i > 0. 

To evaluate third term of R.H.S. of (A.11), we make use of the result 
{x/(x + s)} s < T(x + s)/ (arTa;) < 1 for 0 < s < 1 and x > 0 from Wendel 
(1948). It can be shown that, 


Y{{v + p(a))/2} ^ fv + p's} p ^ p( “ c)l/2 
T{{u+p(a c ))/ 2 } - ) 


(A.13) 


Hence from (A.11), (A.12) and (A.13), 


aeA 2 m ac (y n ) 

fn 2 v\ ( p_p (ac))/2 / d 1 \~ ( ' rL ~ 1 ^ 2 fu + p\ p ^ 2 (l+j)0(l)/(ra)} 

“V 2 / V + ™V V 2 / {1 +P O p (l)/(vn)Y 

Using assumption (A.4) we also get an upper bound of the ratio of prior 
probabilities of the models. Therefore, 

y- p(M a ) m a (y n ) , f n 2 u \ (p ~ p(ctc))/2 / $y\ -("-b/ 2 / u + p \P 2 

a ^p(M ac )m ac (y n )- V 2 ) V n’J V 2 ) ' 

(A. 14) 


for some constant C'. It is easy to check that the above quantity goes to 0 
as n —> oo when s < (1 — b)/ 2. 

Next we prove (A.10) for i — 1. We recall (A.11) and consider each term 
of R.H.S.. We have, for any a G A\, 

f l-< y- 1)f ' = f e-„(/-P„( ac ))e„ l ( "- 1)/2 

le;,(7-P„( a ))e„/ 

f , _ < (-C.(g) ~ P*(<*c)) e n I ~ l "~ 1)/2 
l < ( J - Pn(a c )) e n j 


We now use Lemma 2 of Mukhopadhyay, Samanta and Chakrabarti (2014) 
which states that for any R > 2, with probability tending to 1, 


6 n (-Pn(cy) Pn(0i c ))e n 

max- , , - - —rr— 

“e^i a z (p(a) - p(a c )) 


< Rlogp. 


(A.15) 
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Again from part (vi) of Lemma A.l, e' n (/ — P n (a c )) e n > na 2 (l — e) with 
probability tending to 1, for any e > 0. These facts will imply that for any 
R > 2, with probability tending to 1 uniformly in a £ A\, 


VI -Rl 


(n-l)/2 


< 

< 


exp 


o- 2 {p{oi) - p(a c ))Rlogp \ (n i)/2 
na 2 (l — e) J 

(p(a) -p(a c ))fllogj)| 


2(1-e) 2 

(for any 0 < z < e < 1, (1 — z) > exp {— z/{l 

ip\R{p(.a)-p(a c ))/2(l-e) 2 


-e)}) 

(A.16) 


Combining (A.11), (A.13), (A.16) and using assumption (A.4) we have, 


E 

06^1 


pjMq) m Q ( y n ) 
p(M a J m Qc (y n ) 


aeAi 

p—p( a c. 

S c E 

<7=1 

<ct(, 


(is + p)p R R l d 2 
n 2 u 


(p(a)-p(a c ))/2 


p — p(a c )\ ( y/is + p p R R 2 ^ d 2 }^ 9 


q 


ri\/is 


. - D/f,/, \ (P-A“d) 

y/is + p p R R 2(1 d } 


n\ v 




The above expression converges to 0 as n —> oo, if the first term in the 
curly braces converges to 1. If v = 1 and p = 0(n b ) then the first term 
/ 2 \ 

is less than (1 + C ?r _ [ 1 -(^+ 1 ) fe /{ 2 ( 1 -d }] j f or SO me positive constants C 

and k, any R > 2 and any e > 0. Also if is — p then this term is less 
/ 2 \ k nb 

than (1 + C" 77 ,-[ 1 -- R6 /{ 2 ( 1 -d h j f or some positive constants C" and k, any 

R > 2 and any e > 0. Letting R / 2 and e l 0, we conclude that the last 
expression in (A. 17) converges to 0 if b < 2/5 when is — 1 and if b < 1/2 
when is = p. 

Case II. M ac = M^. When the null model is true, the Bayes factor of any 
model with respect to the null model is given by, 

r(l+g)<“--<“ , - 1 ) /2 {l + g(l-i7 2 )}- < "- 1,/2 x( 9 )dg, (A.18) 

"Lv(y n) Jo 
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where tt( g) is as in (2.1). Now 


1 + 9 

_1 + (1 -Rl)g_ 


(n-l)/2 


< 

< 


exp 

exp 

exp 

exp 

exp 


71 — 1 
2 

n — 1 

2 ) ' 1 + g* 

where g* e [(1-R 2 a )g,g] 
n-l\ f Rl 


{ln(l + g) — ln(l + (1 — R^)g)} 
ln(l + g) — ln(l + g) + R ° 9 


2 

n — 1 


(l-Rl) + l/gj\ 

Rl 


1 - Rl 


tt 1^ ( 6 n (Pn(oO Pn{ptcf)^r 
Rn(tx))@n 


since nS 2 = e’ n e n — ne 2 and ne 2 = e / n P n (a c )e n when the null model is true. 

Next we use the facts that with probability tending to 1, e' n (I — P n (a))e n > 
(n — l)cr 2 (l — e), for any e > 0 and for any R > 2, max Q£i 4 e' n P n (a)e n ) / a 2 < 
Rp(a) ln(p). Combining these, we have for any e > 0, 


1 + 1? 

_1 + (1 — R 2 a )g _ 


(n—1)/2 

< p-Rp(a)/2(l-f) 


with probability tending to 1. 
Hence from (A. 18) we have, 


rnJ/fn) < p Rp(a)/2(l-e) (l + ) rf 

m N { y n ) J 0 


(A.19) 


Now with Tt(g) as given in (2.1) 


I= (! + 1?) P(a)/2 TT(g)dg < 


(r 2 W 2) 

T(u/2) 


v/2 


e ~T 2 v/2g g-(p(a)+v)/2-l 


dg 


by the fact that (1 + g) 1 < g 1 . We then have 


I 


( t 2 / 2 )—A« )/2 r{(p( a ) + l)/2}/r(l/2) for u = 1, 
(t 2 p/ 2)~ v{cc ^ 2 T{(p(a) + p)/2}/T(p/2) for v = p. 
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To evaluate these terms, we again use the results of Wenclel (1948) stated 
above. After little algebra one can show 


(r 2 / 2)~ p(a)/2 (p/: 2) p( “ )/2 for u — 1, 
{r 2 p/ 2) p ^ a ^ 2 pp ( a )/ 2 for v = p. 


By assumption (A.4) and putting r = n, it follows from (A. 19) that 
\ - p{M a ) m a (y n ) 

Yn) 

ae(A-{a c }) 

' C {p 1 +R/( 1 ~ £) /n 2 ) P(a)/2 /Vn for v — 1, 

^ a£(A-{a c }) 

C (2p fl/(1 - e) /n 2 ) p( “ )/2 for v = p. 

k “S(-4—{a c }) 

J C {(l + pf 1+ii /( 1_e )}/ 2 / n ) p _ ij for i/ = 1, 

| C | (l + / n ) p — l| for v = p, 


for any R > 2 and any e > 0. 

As before, we let R j. 2 and e 4- 0 and observe that the above quantity 
converges to 0 when p is of order n b if b < 2/5 for v = 1 and b < 1/2 for 
v — p. □ 


A. 4 . Proof of Theorem 3.2 

We proceed as in the proof of the Theorem 3.1, and prove (A.10) with A% 
replaced by its analog for nested models, A*. As before we consider separately 
the cases when the true model is null and when it is non-null. 

Case-I: M ac M/. First we consider i — 2. Recall the proof of Theorem 
3.1. It can easily be seen that this part in Theorem 3.1 is proved for the 
model space A 2 and without using the assumption of normality. Since Af 2 is 
a proper subset of A 2 , here also the same proof holds. 

Next consider (A. 10) with Ai replaced by A* and i — 1. Here we use 
Lemma A.3 which is equivalent to (A. 15) when nested models are consid¬ 
ered. This implies that (A. 16) also holds here. Combining (A. 11), (A. 13) and 
(A. 16) we have, for any R > 0 and any e > 0, 
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E 

a&Al 


p(M a ) m a { y„ 
p{M a J m ac (y h 



[v + p)p R ^ 1 e ) 2 
n 2 u 


(p(a)-p(a c ))/2 



y/v + p p R / 2 ( l e ) 2 



9 


v^ + P P R/2(1 6)2 ^ _1_ (A 211 

Tly/v J (1 — y/v + p p R / 2 ( 1 - £ ) 2 /ny/v) 

For suitably chosen i? and e, it can be easily seen that (A.21) converges to 0 
for any 0 < b < 1. 

Case II. M ac = M N . We proceed as in Theorem 3.1. Observe that using 
Lemma A.3 we obtain the inequality similar to (A.20) with A replaced by 
A*. Thus we have 

E p(M a ) m a (y n ) 
p(M a ) m a ( y n ) 

a£(A*—{a c }) acJ a cWnJ 

' C {p 1 +R,{ 1 ~ e) /n 2 Y {a)/2 /y/^ for u — 1 , 

< a&(A*-{a c }) 

C (2 p R/ ^- e y n 2 ) p(a)/2 for v = p. 

k «e(^-{o c }) 

J C { p l/2+K/{2(l-e) }/ _ p l/2 + R/ { 2 { l-e)}^ } for ^ = ^ 

< | C | V 2 p R/{2 ^- £)} / (n - V 2 p R/ {2(1 " e)} ) | for v — p. 

One can choose R > 0 and e > 0 suitably and show that the above quantities 
go to 0 for any 0 < b < 1. □ 



A.5. Proof of Theorem f.l 


Let M 0c = Mm- From (A. 18) and (A.4) we have 


E 

a£A-{a c } 


p(M a ) m a (y n ) 
p(M ac ) m Qc (y n ) 



/*oo 

/ ( x + n)~ p{a)/2 ^{g)dg 

a 


(A.22) 


imsart-generic ver. 2014/10/16 file: ims-sample.tex date: April 16, 2015 














Mukhopadhy ay/Mixture of g-priors for variable selection 32 

where ir(g) is given by (4.1). Putting the prior we get the R.H.S. of the above 
expression as 


1 y- r( 7 o + 7i)r(7i +p(at)/2) 

C c,G a ■ r 7ir(7o + 7i +p(a)/2) ' 

aEA—{a c t 

Using the inequality of Wendel (1948) stated above and the fact that 71 > e 
for some e > 0 free of n, it can be shown that for some constant C > 0 , the 
above expression is bigger than 



Thus if p — 7i b , the R.H.S. of (A.22) does not go to 0 if 70/71 = 0{n b ). □ 


A.6. Proof of Theorem 5.1 


Our model selection criterion is to choose a model a in the model space A. 
which maximizes p(M a )m a (y n ) with respect to a. Now, from Result 2.1, this 
is equivalent to maximizing 


p(M a ) r 


v + p(a) 


{nS 2 y { 1 


2 U -(n-l)/2 


K)} 


0 

nv 


-p(a)/2 


(1 + e n (a)), 


where |e n (a)| = pO p {l)/(nv) uniformly in a. We omit the other terms in¬ 
volved in the approximation of Result 2.1, since those are free of a. Maxi¬ 
mizing the above is equivalent to minimizing 


p(M a ) r 


v +p(a) 


(1 + £n(&)) 


1 - 2 / 0 - 1 ) 


2 

nv 


p(a)/(n-l) 


ns 2 t ( 1 - «: 


(A. 23) 

with respect to a. From (5.2) we have nS y ( 1 — R 2 0 ) = C n + 2a 2 D n (a){l + 
'Cn(c^)), where C n = e' n e n and ^ n (o;) = {2fi' n (I-P n (a))e n -e' n P n (a)e n }/(2o 2 D ri 
Hence, if is the model for which (A.23) is minimized, then 


(«))■ 


DJA) < C n (b n (a) - 1) b n (a)( 1 + ^ n (a)) 
D n (a) ~ 2o 2 D n (a)(l +^ n (d)) (l + ^(d)) 
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where 


b n (a) = 


( P(M t 


V P( M C 


2/(n-l) 


r{(n + p(a))/2} \ 

T{(n + p(a))/2}J 


2/L-1) / n ^ v \ OVL- 1 ) 


1 + e n (a) 
1 + £ n {oi) 


2/(n-l) 


(A.24) 


Therefore, if = max a |£ n (a)|, we have 

D n (a ) C n _ n(b n (a) - 1) (1 + f„) 


1 < 


< 


x max 


min a D n (a) 2ncx 2 (l - £„) a D n (a ) (1 - f„) 


The rest of the proof will follow from the following facts 

CJn A a 2 , 

A A 0 , 

n(b n (a) - 1) p 


max 


D n (a) 

and max b n (a) A 1. 


AO, 


x ma xb n (a). 

a 

(A.25) 

(A.26) 
(A.27) 

(A.28) 
(A.29) 


The proof of (A.26) is straight forward. To prove (A.27) we note that 


< 2 maxq fi' n (I - P n (a))e n /n - min Q e' n P n (a)e n /n < O v {\fp[n) 
n ~ 2 min a cr 2 D n (a)/n ~ 5/n s ’ 

from (iv) and (v) of Lemma A.l and assumption (A.3). Clearly if s < (1 — 
b)/2 , (A.27) holds. 

Next we prove (A.29). We show that log(max a b n {a )) = max a log(6 n (a)) A 
0. From (A. 24) we have 


max log b n (a) < 


loE LaxAAA + Ioe Lax r{(rc+P(°))/2n 

n-1 I 8 V » P(M a )J H « r{(n + p(a))/2}/ 


log 


1 + £ n (&) 


1 — max a \£ n (a)\ 


p(a ) — p(a) , 

+ max AAA—LAA i og 


2 

nA 


< 


n - 1 V 2 

2 f p fp + v\ (l+pO p {l)/{nv 

+2 log (—) +log ( i-Aw/U 


v 


n 


log 


2 

n z/ 
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form (A. 13) and assumption (A.4). It is now easy to show that when p = 
0(n b ), 0 < b < 1, the above expression is O p ( n ~^-~ b ) log(n)) and converges 
to 0 with probability tending to 1. Hence (A.29) holds. 

Finally we prove (A.28). By mean value theorem, for some z > 0, (e z — 1) = 
ze z * < ze z , where z* G [0, z]. Replacing z by log6 n (a) we get 

max(6 n (a) — 1) < max log (a) exp{maxlog6 n (a)}. 


Thus by assumption (A.3) we have 


max ~ t £ logW) 

« D n (a) mm a D n (a)/n 

exp {O p (n~ (1 ~ b) log(n)) } , 
which is going to 0, with probability tending to 1. 


□ 
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